MGT 6203: Data Analytics in Business
Team 45
Jeff Hedberg, Lisa Chille, Nick Cunningham, Brittany Lange, Delband Taha
July 06, 2022
A hypothetical auto repair shop may struggle with demand forecasting since no one plans on getting into an accident. Therefore, none of their customers have any prior intent of availing their services before finding themselves with the challenge of getting their car repaired1. By statistically analyzing and making inferences from collision data, the shop stands a chance at getting an understanding of their customers’ patterns and can therefore ready themselves by making the appropriate hiring/firing and physical expansion/subletting decisions as necessary. With a little more effort, they can even learn which demographic(s) to raise awareness with such that in the event of a crisis, this repair shop is top of mind for the affected parties.
This project aims at supporting the annual planning process of a hypothetical repair shop in New York state through careful data analysis and inference that leads to recommendations for the auto repair shop owner(s).
Our objective with this project is to use data compiled by New York City police officers (NYPD) detailing New York motor vehicle collision reports to help a fictitious auto repair center in New York state estimate the volume of incoming vehicles they can expect to repair in the coming year, assuming their market share is known. Our analysis can help this repair center predict staffing levels that they will need to maintain and identify potential opportunities for expansion. We also aim to analyze the demographics of those involved in collisions and identify which groups make a significant contribution to car collisions. We can then propose and measure the impact of a marketing campaign for this repair center.
We assume equal market share for the 3164 Motor Vehicles motor vehicle repair shops found in New York City. This gives us an average market share of 0.0316% per shop, which we will use to estimate the volume of incoming vehicles they can expect.
Research conducted by Songtao He, et al. He et al. at MIT’s Computer Science and Artificial Intelligence Laboratory and the Qatar Center for Artificial Intelligence in which they used crash data from four US cities, Los Angeles, New York City, Chicago, and Boston, to develop a model predicting the number of future crashes in high risk areas, has shown on a high resolution map helped us set direction for this project. They used historical crash data, combined with GPS and satellite images and road maps. These scientists evaluated their model on crash data from 2017 and 2018, and used the next two years to test its performance at predicting crashes. Ultimately, for this project we wanted to build models for a potential real world business application so we chose a project with a fictitious business. We chose to follow a similar methodology for splitting our data and evaluating and testing our model.
Our approach is detailed in this document. We intend to
The models we intend to use are also detailed in this report.
First, we import necessary libraries. These libraries will be used as follows:
| Library | Use |
|---|---|
dplyr |
Data manipulation |
knitr |
RMarkdown knitting |
kableExtra |
Kable tables |
plotly |
Plotting |
lubridate |
Date manipulation |
randomForest |
Building random forest models |
rcart |
Building CART models |
rpart.plot |
Plotting regression trees built |
library(dplyr)
library(knitr)
library(kableExtra)
library(plotly)
library(lubridate)
library(randomForest)
library(rpart)
library(rpart.plot)
Below, we import data from 3 sources and view the raw summaries. The data are related to each other as follows: !Data relationships
These functions will be reused in our endeavour to read and summarise the data at hand.
read_data <- function(path) {
data <- read.csv(path, stringsAsFactors = FALSE)
colnames(data)[colnames(data) == 'CRASH.DATE'] <- 'CRASH_DATE'
data$CRASH_DATE <- as.Date(data$CRASH_DATE, "%m/%d/%Y")
data
}
tabulate_collisions <- function(data) {
kable(t(summary(data))) %>% kable_classic(full_width = TRUE, html_font = "Cambria",
font_size = 14)
}
Crashescrashes_df <- read_data('./Crashes.csv') #1,896,229 x 29
tabulate_collisions(crashes_df)
| CRASH_DATE | Min. :2012-07-01 | 1st Qu.:2014-10-28 | Median :2016-12-15 | Mean :2017-01-01 | 3rd Qu.:2019-01-04 | Max. :2022-05-29 | NA |
| CRASH.TIME | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| BOROUGH | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| ZIP.CODE | Min. :10000 | 1st Qu.:10306 | Median :11207 | Mean :10837 | 3rd Qu.:11237 | Max. :11697 | NA’s :587695 |
| LATITUDE | Min. : 0.00 | 1st Qu.:40.67 | Median :40.72 | Mean :40.64 | 3rd Qu.:40.77 | Max. :43.34 | NA’s :220042 |
| LONGITUDE | Min. :-201.36 | 1st Qu.: -73.98 | Median : -73.93 | Mean : -73.77 | 3rd Qu.: -73.87 | Max. : 0.00 | NA’s :220042 |
| LOCATION | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| ON.STREET.NAME | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| CROSS.STREET.NAME | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| OFF.STREET.NAME | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| NUMBER.OF.PERSONS.INJURED | Min. : 0.0000 | 1st Qu.: 0.0000 | Median : 0.0000 | Mean : 0.2873 | 3rd Qu.: 0.0000 | Max. :43.0000 | NA’s :18 |
| NUMBER.OF.PERSONS.KILLED | Min. :0.000000 | 1st Qu.:0.000000 | Median :0.000000 | Mean :0.001358 | 3rd Qu.:0.000000 | Max. :8.000000 | NA’s :31 |
| NUMBER.OF.PEDESTRIANS.INJURED | Min. : 0.00000 | 1st Qu.: 0.00000 | Median : 0.00000 | Mean : 0.05304 | 3rd Qu.: 0.00000 | Max. :27.00000 | NA |
| NUMBER.OF.PEDESTRIANS.KILLED | Min. :0.000000 | 1st Qu.:0.000000 | Median :0.000000 | Mean :0.000697 | 3rd Qu.:0.000000 | Max. :6.000000 | NA |
| NUMBER.OF.CYCLIST.INJURED | Min. :0.00000 | 1st Qu.:0.00000 | Median :0.00000 | Mean :0.02435 | 3rd Qu.:0.00000 | Max. :4.00000 | NA |
| NUMBER.OF.CYCLIST.KILLED | Min. :0.0000000 | 1st Qu.:0.0000000 | Median :0.0000000 | Mean :0.0001007 | 3rd Qu.:0.0000000 | Max. :2.0000000 | NA |
| NUMBER.OF.MOTORIST.INJURED | Min. : 0.0000 | 1st Qu.: 0.0000 | Median : 0.0000 | Mean : 0.2083 | 3rd Qu.: 0.0000 | Max. :43.0000 | NA |
| NUMBER.OF.MOTORIST.KILLED | Min. :0.00000 | 1st Qu.:0.00000 | Median :0.00000 | Mean :0.00055 | 3rd Qu.:0.00000 | Max. :5.00000 | NA |
| CONTRIBUTING.FACTOR.VEHICLE.1 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING.FACTOR.VEHICLE.2 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING.FACTOR.VEHICLE.3 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING.FACTOR.VEHICLE.4 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING.FACTOR.VEHICLE.5 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| COLLISION_ID | Min. : 22 | 1st Qu.:3046695 | Median :3584305 | Mean :3021392 | 3rd Qu.:4058626 | Max. :4533068 | NA |
| VEHICLE.TYPE.CODE.1 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE.TYPE.CODE.2 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE.TYPE.CODE.3 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE.TYPE.CODE.4 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE.TYPE.CODE.5 | Length:1896229 | Class :character | Mode :character | NA | NA | NA | NA |
Personperson_df <- read_data('./Person.csv') #4,692,054 x 21
tabulate_collisions(person_df)
| UNIQUE_ID | Min. : 10922 | 1st Qu.: 6812186 | Median : 8996148 | Mean : 8531863 | 3rd Qu.:10216281 | Max. :12239058 | NA |
| COLLISION_ID | Min. : 37 | 1st Qu.:3638855 | Median :3921823 | Mean :3853306 | 3rd Qu.:4210666 | Max. :4533068 | NA |
| CRASH_DATE | Min. :2012-07-01 | 1st Qu.:2017-03-19 | Median :2018-06-08 | Mean :2018-07-08 | 3rd Qu.:2019-09-20 | Max. :2022-05-29 | NA |
| CRASH_TIME | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PERSON_ID | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PERSON_TYPE | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PERSON_INJURY | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_ID | Min. : 123423 | 1st Qu.:17466247 | Median :18528882 | Mean :18253620 | 3rd Qu.:19125401 | Max. :20229580 | NA’s :185684 |
| PERSON_AGE | Min. :-999.0 | 1st Qu.: 23.0 | Median : 35.0 | Mean : 36.8 | 3rd Qu.: 50.0 | Max. :9999.0 | NA’s :453265 |
| EJECTION | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| EMOTIONAL_STATUS | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| BODILY_INJURY | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| POSITION_IN_VEHICLE | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| SAFETY_EQUIPMENT | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PED_LOCATION | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PED_ACTION | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| COMPLAINT | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PED_ROLE | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING_FACTOR_1 | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING_FACTOR_2 | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
| PERSON_SEX | Length:4692054 | Class :character | Mode :character | NA | NA | NA | NA |
Vehiclesvehicles_df <- read_data('./Vehicles.csv') #3,704,406 x 25
tabulate_collisions(vehicles_df)
| UNIQUE_ID | Min. : 111711 | 1st Qu.:14215160 | Median :17306058 | Mean :16060871 | 3rd Qu.:18739205 | Max. :20121717 | NA |
| COLLISION_ID | Min. : 22 | 1st Qu.:3017853 | Median :3567068 | Mean :2996659 | 3rd Qu.:4028145 | Max. :4484197 | NA |
| CRASH_DATE | Min. :2012-07-01 | 1st Qu.:2014-10-15 | Median :2016-11-18 | Mean :2016-11-21 | 3rd Qu.:2018-11-15 | Max. :2021-12-04 | NA |
| CRASH_TIME | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_ID | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| STATE_REGISTRATION | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_TYPE | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_MAKE | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_MODEL | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_YEAR | Min. : 1000 | 1st Qu.: 2008 | Median : 2013 | Mean : 2015 | 3rd Qu.: 2016 | Max. :20063 | NA’s :1796971 |
| TRAVEL_DIRECTION | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_OCCUPANTS | Min. :0.00e+00 | 1st Qu.:1.00e+00 | Median :1.00e+00 | Mean :1.01e+03 | 3rd Qu.:1.00e+00 | Max. :1.00e+09 | NA’s :1718406 |
| DRIVER_SEX | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| DRIVER_LICENSE_STATUS | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| DRIVER_LICENSE_JURISDICTION | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| PRE_CRASH | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| POINT_OF_IMPACT | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_DAMAGE | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_DAMAGE_1 | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_DAMAGE_2 | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| VEHICLE_DAMAGE_3 | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| PUBLIC_PROPERTY_DAMAGE | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| PUBLIC_PROPERTY_DAMAGE_TYPE | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING_FACTOR_1 | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
| CONTRIBUTING_FACTOR_2 | Length:3704406 | Class :character | Mode :character | NA | NA | NA | NA |
As a result of a traffic safety initiative to eliminate traffic fatalities, the NYPD replaced its record management system with an electronic one, (FORMS), in 2016 (“Motor Vehicle Collisions - Crashes”). We see this change reflected in the chart below.
monthly_agg1 <- (person_df %>%
mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>%
group_by(yr_mo) %>%
summarise(n = n()) %>%
arrange(yr_mo)
)
plot_ly(x=monthly_agg1$yr_mo, y=monthly_agg1$n, type='bar') %>%
layout(title = "Crash Data by Month from 2012-2022", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))
Note that the amount of data collected from March 2016 on greatly surpasses the amount previously collected. To control for the change in data recording systems, we will use data collected after January 1st, 2017 for full year modeling.
We are now ready to filter the data to select columns of interest. We will also combine data from our three sources into one place, using their common factors. A summary can be found below.
combined_df <- (crashes_df %>%
select(-c(CRASH_DATE, CRASH.TIME, CONTRIBUTING.FACTOR.VEHICLE.1, CONTRIBUTING.FACTOR.VEHICLE.2,
CONTRIBUTING.FACTOR.VEHICLE.3, CONTRIBUTING.FACTOR.VEHICLE.4, CONTRIBUTING.FACTOR.VEHICLE.5,
VEHICLE.TYPE.CODE.4, VEHICLE.TYPE.CODE.5,
NUMBER.OF.PEDESTRIANS.INJURED, NUMBER.OF.PEDESTRIANS.KILLED,
NUMBER.OF.CYCLIST.INJURED, NUMBER.OF.CYCLIST.KILLED,
ON.STREET.NAME, CROSS.STREET.NAME, OFF.STREET.NAME)) %>%
inner_join(vehicles_df %>%
filter(!is.na(VEHICLE_ID)) %>%
select(-c(CRASH_DATE, CRASH_TIME, VEHICLE_ID))
, by = 'COLLISION_ID') %>%
inner_join(person_df %>%
select(-c(UNIQUE_ID, CONTRIBUTING_FACTOR_1, CONTRIBUTING_FACTOR_2))
, by= c('COLLISION_ID'='COLLISION_ID', 'UNIQUE_ID'='VEHICLE_ID')) %>%
filter((PED_ROLE %in% c('Driver'))) %>% #drivers only
filter((CRASH_DATE >= '2017-01-01') & (CRASH_DATE < '2021-12-01')) %>% #only from 2017-01-01 to 2021-11-30
filter(PERSON_AGE > 14 & PERSON_AGE< 101) %>%
filter(PERSON_SEX == 'F' | PERSON_SEX=='M') %>%
mutate(MONTH = substr(CRASH_DATE,6,7)) %>%
mutate(TIMESTEP = as.period(interval(as.Date('2017-01-01'), CRASH_DATE)) %/% months(1)) %>%
mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>%
select(COLLISION_ID, CRASH_DATE, PERSON_AGE, PERSON_SEX, MONTH, TIMESTEP, yr_mo) %>%
arrange(CRASH_DATE)
)
kable(t(summary(combined_df))) %>% kable_classic(full_width = TRUE, html_font = "Cambria", font_size = 14)
| COLLISION_ID | Min. :3589945 | 1st Qu.:3805149 | Median :4017960 | Mean :4023043 | 3rd Qu.:4234930 | Max. :4484186 |
| CRASH_DATE | Min. :2017-01-01 | 1st Qu.:2017-12-04 | Median :2018-11-05 | Mean :2018-12-28 | 3rd Qu.:2019-11-01 | Max. :2021-11-30 |
| PERSON_AGE | Min. : 15.00 | 1st Qu.: 29.00 | Median : 40.00 | Mean : 41.69 | 3rd Qu.: 53.00 | Max. :100.00 |
| PERSON_SEX | Length:1339106 | Class :character | Mode :character | NA | NA | NA |
| MONTH | Length:1339106 | Class :character | Mode :character | NA | NA | NA |
| TIMESTEP | Min. : 0.00 | 1st Qu.:11.00 | Median :22.00 | Mean :23.44 | 3rd Qu.:34.00 | Max. :58.00 |
| yr_mo | Length:1339106 | Class :character | Mode :character | NA | NA | NA |
We create our aggregate of volume data available for modeling and visualize it with a plot. It is interesting to note the impact of COVID-19 on crash volume beginning in March 2020.
monthly_agg2 <- (combined_df %>%
group_by(TIMESTEP, yr_mo, MONTH) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(TIMESTEP, yr_mo, MONTH)
)
plot_ly(x=monthly_agg2$yr_mo, y=monthly_agg2$n, type='bar') %>%
layout(title = "Crash Data by Month from 2017-2021", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))
We also created a coarse aggregate for modeling with feature groups and separate our data into training and test data sets. Since we are dealing with a temporal model, we will select all except the final year for training, and then use the final year for testing. A random split would be nonsensical for our purposes.
monthly_agg3 <- (combined_df %>%
group_by(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE) %>%
summarise(n = n(), .groups = 'drop') %>%
arrange(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE)
)
train_df <- monthly_agg3 %>% filter(TIMESTEP <= 47) # <= 2020-12
test_df <- monthly_agg3 %>% filter(TIMESTEP > 47) # > 2020-12
lm_model <- lm(n~TIMESTEP+MONTH+PERSON_AGE+PERSON_SEX, data = train_df)
lm_summary <- summary(lm_model)
lm_summary
##
## Call:
## lm(formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -441.83 -64.83 4.23 58.78 299.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 352.82598 5.76802 61.169 < 2e-16 ***
## TIMESTEP -2.92530 0.09545 -30.646 < 2e-16 ***
## MONTH02 -10.16351 6.23069 -1.631 0.102889
## MONTH03 5.85024 6.23280 0.939 0.347955
## MONTH04 -17.30059 6.27380 -2.758 0.005837 **
## MONTH05 13.68088 6.25906 2.186 0.028863 *
## MONTH06 19.15099 6.25343 3.062 0.002203 **
## MONTH07 13.20462 6.25123 2.112 0.034691 *
## MONTH08 12.80755 6.26343 2.045 0.040908 *
## MONTH09 18.12040 6.27354 2.888 0.003883 **
## MONTH10 27.72229 6.28431 4.411 1.04e-05 ***
## MONTH11 22.61062 6.30315 3.587 0.000336 ***
## MONTH12 23.03260 6.32483 3.642 0.000273 ***
## PERSON_AGE -3.92660 0.05504 -71.344 < 2e-16 ***
## PERSON_SEXM 150.90222 2.55012 59.175 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 111.3 on 7619 degrees of freedom
## Multiple R-squared: 0.5457, Adjusted R-squared: 0.5449
## F-statistic: 653.8 on 14 and 7619 DF, p-value: < 2.2e-16
R_sq_lm_train <- lm_summary$r.squared
R_sq_lm_train
## [1] 0.5457191
SST_lm_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_lm_train <- sum((train_df$n - predict(lm_model, newdata = train_df))^2)
R_sq_lm_train <- 1-(SSE_lm_train/SST_lm_train)
R_sq_lm_train
## [1] 0.5457191
predictions <- predict(lm_model, newdata = train_df)
plot_ly(x=train_df$yr_mo, y=(predictions-train_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Linear Model Residuals vs yr_mo for Training Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'LM Residuals', rangemode = "tozero"))
SST_lm_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_lm_test <- sum((test_df$n - predict(lm_model, newdata = test_df))^2)
R_sq_lm_test <- 1-(SSE_lm_test/SST_lm_test)
R_sq_lm_test
## [1] 0.08364779
plot_ly(x=test_df$yr_mo, y=(predict(lm_model, newdata = test_df)-test_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Linear Model Residuals vs yr_mo for Test Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'LM Residuals', rangemode = "tozero"))
Our linear model with training data had an R-squared value of 0.55, while it’s performance on test data dropped the R-squared value to 0.08.
set.seed(12345)
rf_model <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, importance=TRUE, ntree=110) #
summary(rf_model)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 7634 -none- numeric
## mse 110 -none- numeric
## rsq 110 -none- numeric
## oob.times 7634 -none- numeric
## importance 8 -none- numeric
## importanceSD 4 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 7634 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
var_importance_df <- data.frame(importance(rf_model)) %>% rename('PCT_IncMSE'='X.IncMSE') %>% arrange(desc(PCT_IncMSE))
var_importance_df
SST_rf_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_rf_train <- sum((train_df$n - predict(rf_model, newdata = train_df))^2)
R_sq_rf_train <- 1-(SSE_rf_train/SST_rf_train)
R_sq_rf_train
## [1] 0.8639321
plot_ly(x=seq(1,length(rf_model$mse),1), y=rf_model$mse, mode='lines+markers', type='scatter') %>%
layout(title = 'Plot of Random Forest Training Error vs Number of Trees',
xaxis = list(title = 'Number of Trees'),
yaxis = list(title = 'Error', rangemode = "tozero"))
plot_ly(x=train_df$yr_mo, y=(predict(rf_model, newdata = train_df)-train_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Random Forest Model Residuals vs yr_mo',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'RF Residuals', rangemode = "tozero"))
SST_rf_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_rf_test <- sum((test_df$n - predict(rf_model, newdata = test_df))^2)
R_sq_rf_test <- 1-(SSE_rf_test/SST_rf_test)
R_sq_rf_test
## [1] 0.7582554
plot_ly(x=test_df$yr_mo, y=(predict(rf_model, newdata = test_df)-test_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of Random Forest Model Residuals vs yr_mo for Test Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'RF Residuals', rangemode = "tozero"))
Our random forest model performed much better, giving us an R-squared value of 0.80 on training data and 0.68 on test data.
set.seed(12345)
min_leaf <- 3
min_split <- 3*min_leaf
cart_model <- rpart(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df,
control = c(minsplit = min_split, minbucket = min_leaf, cp=0.01)) #default = 0.01
SST_CART_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_CART_train <- sum((train_df$n - predict(cart_model, newdata = train_df))^2)
R_sq_cart_train <- 1-(SSE_CART_train/SST_CART_train)
R_sq_cart_train
## [1] 0.9023338
data.frame(Variable_Importance = cart_model$variable.importance,
Variable_Importance_Pct_Tot = round(100*cart_model$variable.importance/sum(cart_model$variable.importance),0))
rpart.plot(cart_model)
plot_ly(x=train_df$yr_mo, y=(predict(cart_model, newdata = train_df)-train_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of CART Model Residuals vs yr_mo',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'CART Residuals', rangemode = "tozero"))
SST_cart_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_cart_test <- sum((test_df$n - predict(cart_model, newdata = test_df))^2)
R_sq_cart_test <- 1-(SSE_cart_test/SST_cart_test)
R_sq_cart_test
## [1] 0.65263
plot_ly(x=test_df$yr_mo, y=(predict(cart_model, newdata = test_df)-test_df$n), type='scatter', mode='markers') %>%
layout(title = 'Plot of CART Model Residuals vs yr_mo for Test Data',
xaxis = list(title = 'yr_mo'),
yaxis = list(title = 'CART Residuals', rangemode = "tozero"))
We see with the CART model that the R-squared value for training data is 0.90, but we see a drop on test data to 0.65.
Each model’s performance can be compared in the following plots.
plot_ly(x=train_df$n, y=(predict(cart_model, newdata = train_df)), type='scatter', mode='markers', name='CART', alpha=0.3) %>%
add_trace(x=train_df$n, y=(predict(rf_model, newdata = train_df)), type='scatter', mode='markers', name='RF', alpha=0.3) %>%
add_trace(x=train_df$n, y=(predict(lm_model, newdata = train_df)), type='scatter', mode='markers', name='LM', alpha=0.3) %>%
add_trace(x=c(0,500), y=c(0,500), type='scatter', mode='line', name='Perfect', alpha=1) %>%
layout(title = 'Plot of All Model Predictions vs Actual Values - Training Data',
xaxis = list(title = 'Actual Value', range=c(0,700)),
yaxis = list(title = 'Model Prediction', rangemode = "tozero"))
plot_ly(x=test_df$n, y=(predict(cart_model, newdata = test_df)), type='scatter', mode='markers', name='CART', alpha=0.3) %>%
add_trace(x=test_df$n, y=(predict(rf_model, newdata = test_df)), type='scatter', mode='markers', name='RF', alpha=0.3) %>%
add_trace(x=test_df$n, y=(predict(lm_model, newdata = test_df)), type='scatter', mode='markers', name='LM', alpha=0.3) %>%
add_trace(x=c(0,300), y=c(0,300), type='scatter', mode='line', name='Perfect', alpha=1) %>%
layout(title = 'Plot of All Model Predictions vs Actual Values - Test Data',
xaxis = list(title = 'Actual Value', range=c(0,325)),
yaxis = list(title = 'Model Prediction', rangemode = "tozero"))
kable(data.frame(Model_Type = c('Linear', 'Random Forest', 'CART'),
R_sq_train = round(c(R_sq_lm_train, R_sq_rf_train, R_sq_cart_train), 2),
R_sq_test = round(c(R_sq_lm_test, R_sq_rf_test, R_sq_cart_test), 2))) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")%>%
row_spec(2, background = c("yellow"))
| Model_Type | R_sq_train | R_sq_test |
|---|---|---|
| Linear | 0.55 | 0.08 |
| Random Forest | 0.86 | 0.76 |
| CART | 0.90 | 0.65 |
We can see that our Random Forest model has the best performance on the test data, so we will move forward with this model selected. We will now estimate the monthly volume our audo body repair center can expect in the future, using our expected market share of 0.0316%.
We therefore select the random forest model due to its superior performance.
We continue with the assumption here is that our repair shop will enjoy perfect competition. This would mean that it would have an equal share of the demand i.e. 0.0316%. With this assumption in tow, we estimate the shop’s future monthly car volume.
First, we formally compute the market share of the shop.
number_of_repair_shops <- 3164
Shop_Market_Share <- 3164/(100*number_of_repair_shops)
Now, we use the test data that we had set aside to predict the monthly collision volume.
test_agg_df <- test_df %>%
group_by(MONTH) %>%
summarise(Actual_NYC_Collisions=sum(n), Actual_Shop_Volume=round(Shop_Market_Share*sum(n), 0))
Here, we create dataset for future year (2021).
temp1 <- data.frame(TIMESTEP = c(48:59))
temp2 <- data.frame(MONTH = c('01','02','03','04','05','06','07','08','09','10','11','12'))
temp3 <- data.frame(PERSON_AGE = c(15:100))
temp4 <- data.frame(PERSON_SEX = c('M','F'))
monthly_predictions_df <- temp1 %>%
bind_cols(temp2) %>%
full_join(temp3, by=character()) %>%
full_join(temp4, by=character())
And finally, we make predictions.
monthly_predictions_df$predictions <- predict(rf_model, newdata = monthly_predictions_df)
monthly_predictions_agg_df <- (monthly_predictions_df %>%
group_by(MONTH) %>%
summarise(Predicted_NYC_Collisions=round(sum(predictions), 0)) %>%
mutate(Predicted_Shop_Volume = round(Shop_Market_Share*Predicted_NYC_Collisions, 0)) %>%
left_join(test_agg_df, by='MONTH') %>%
mutate(YEAR = 2021) %>%
select(YEAR, MONTH, Actual_NYC_Collisions, Actual_Shop_Volume, Predicted_NYC_Collisions, Predicted_Shop_Volume)
)
kable(monthly_predictions_agg_df) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
| YEAR | MONTH | Actual_NYC_Collisions | Actual_Shop_Volume | Predicted_NYC_Collisions | Predicted_Shop_Volume |
|---|---|---|---|---|---|
| 2021 | 01 | 9671 | 97 | 16482 | 165 |
| 2021 | 02 | 8432 | 84 | 15864 | 159 |
| 2021 | 03 | 10648 | 106 | 15574 | 156 |
| 2021 | 04 | 11501 | 115 | 12726 | 127 |
| 2021 | 05 | 13394 | 134 | 14655 | 147 |
| 2021 | 06 | 13745 | 137 | 14868 | 149 |
| 2021 | 07 | 12645 | 126 | 15234 | 152 |
| 2021 | 08 | 12473 | 125 | 15381 | 154 |
| 2021 | 09 | 12623 | 126 | 15488 | 155 |
| 2021 | 10 | 13097 | 131 | 15758 | 158 |
| 2021 | 11 | 11522 | 115 | 15324 | 153 |
| 2021 | 12 | NA | NA | 15081 | 151 |
plot_ly(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Actual_Shop_Volume,
type='bar', name='Actual_Shop_Volume') %>%
add_trace(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Predicted_Shop_Volume,
type='bar', name='Predicted_Shop_Volume') %>%
layout(title = 'Plot of actual and predicted shop volume (Test Set)',
xaxis = list(title = 'Months in 2021'),
yaxis = list(title = 'Car Volume'))
We can see that our model actually does quite a good job predicting the shop volume, compared the actual volume the shop saw in 2021. We hypothesize that our model would have performed even better without the unforeseen consequences of COVID-19 and the subsequent increase of remote work availability.
Our next task is to start answering our secondary research objectives, including identifying key demographics for use in a marketing campaign and then measuring its cost and effect.
Proportion of crashes by month, gender
kable(train_df %>%
group_by(PERSON_SEX) %>%
summarise(n = sum(n)) %>%
arrange(PERSON_SEX)) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
| PERSON_SEX | n |
|---|---|
| F | 313706 |
| M | 895649 |
Proportion of crashes by month, age group
kable(train_df %>%
mutate(age_group = 5*floor(PERSON_AGE/5)) %>%
group_by(age_group) %>%
summarise(n = sum(n)) %>%
arrange(age_group)) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
| age_group | n |
|---|---|
| 15 | 26719 |
| 20 | 113808 |
| 25 | 161005 |
| 30 | 155614 |
| 35 | 137152 |
| 40 | 122047 |
| 45 | 116100 |
| 50 | 111292 |
| 55 | 98774 |
| 60 | 73487 |
| 65 | 44235 |
| 70 | 25668 |
| 75 | 13053 |
| 80 | 6627 |
| 85 | 2720 |
| 90 | 834 |
| 95 | 197 |
| 100 | 23 |
An analysis done by researchers in the UK found that in road accident data reported in Great Britain, their adjusted crash risk peaked at age 21-29 years and gradually reduced as age increased. They also found a relationship between the age of the driver and the time of day an accident occurred, namely that teenage drivers were at a lesser risk of crash than other age groups Regev, Rolison, & Moutari. Moving forward, we plan to leverage our data to determine the demographics most likely to be involved in a car crash. We expect to use multiple linear regression with squared predictors standardized around mean 0 and standard deviation 1. We will then determine the cost of a potential promotion according to the Average Clickthrough Rates for search ads and display ads (1.91% and .35%, respectively) Volovich. We will then outline the potential increase in market share and apply these outputs to a costing model for our auto repair shop to derive profits and return on assets for future investments, potentially using linear regression.
To recap, for the next milestone, we intend to